home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Internet Info 1994 March
/
Internet Info CD-ROM (Walnut Creek) (March 1994).iso
/
networking
/
terms
/
kermit
/
charsets
/
iso8859.networking
< prev
next >
Wrap
Text File
|
1992-07-11
|
74KB
|
1,300 lines
Guidelines to use 8-bit character codes
Version 2. July 1992.
A. Pirard
University of Liege
Belgium
Important preliminary notice
This file contains translation tables between proprietary codes and
ISO codes. As indicated, some translate several characters arbitrarily by
lack of a known definition of this translation by the owner of the code
(constructor). So, watch this space for an update indicating any news as
I get to know it.
Since version 1:
- At the request of SHARE, IBM has:
- defined a new code page 1047 compatible with the de-facto EBCDIC.
- defined a new code page 819 corresponding to ISO 8859-1.
- published a document listing the translation between 819 and the SAA
code pages 850 and 500, from which other translation may be deduced.
- see summary of changes at the top of the paragraph about IBM.
- So, the translation tables between 8859-1 and PC codes have been
changed accordingly.
- The translation of the Macintosh code has been changed to account for 6
ISO characters that appear in an Islandic Macintosh code and translate
arbitrarily otherwise. This pushed away 4 other arbitrary translations.
- IBM code pages 850 and 1047 are considered the preferred tables; other
translations were moved to a secondary file to reduce size.
Changes to the text:
- more complete explanation of keyboard handling for the PC.
- updating explanations to follow evolution of usage and terminology.
- minor revisions for clarity.
Introduction.
In the course of my work in communications in a French-speaking
environment -- writing programs, installing but mostly having to adapt
others -- I discovered facts, notions, techniques and data related to
international characters usage. Many English-speaking programmers are
willing to extend the scope of their software to what is for them
"foreign languages". Discussion with them is often lengthy to convey
numerous details that are obvious to one and obscure to the other. Trying
to help without repeating the same words all over again is the reason of
this document.
This text is restricted to the problem of the character codes used
in data. Yet, I should mention briefly that isolating from executable
code the user interface messages is a real plus. These messages should be
easily translatable by anyone who knows the language, even if source is
unavailable. Anything similar to the Macintosh resources is ideal. To
avoid making feel this goal too easy, I must warn than phrases in many
languages are longer than English and that the order of inserts may vary
depending on grammar.
I am much indebted to the people I met on networks and on the
mailing list ISO8859@JHUVM for their discussion (especially Edwin Hart
HART@APLVM, with his SHARE White Paper to IBM). The international
community owes much to the Kermit developpers group led by Christine
Gianone and fed by Frank da Cruz and many volunteers who produced several
Kermit versions using the principles described in this document and store
character codes related data on WATSUN.CC.COLUMBIA.EDU:kermit/charsets. I
should also thank many other people for their interest, especially those
who adapted their programs, but I am truly unable to mention them all.
You will know when some ISO 8859 setting catches your eye.
At the risk of a lack of justification, I have made every effort to
keep this text as concise as possible to spare your time. One will have
to think beyond the text in some places. On the other hand, please excuse
if some paragraphs contain evidence: it is sometimes needed. Also
remember that English is not my mother language...
A language among others: French.
Like many other languages, French uses characters not found in
English. It likes to adorn them with diacritics (accents). Other
languages use other characters, from a few like German to totally
different like Russian and Greek, or even the right to left Arabic and
Hebrew. To the question: "could you do without them?", I like to reply
that forgetting them in "a la francaise" makes it mean "has the French
girl". "a" must take a grave accent to distinguish the preposition from
the verb and "c" takes a cedilla. French without diacritics is certainly
not unreadable, rarely ambiguous with the aid of context (i. e. to humans
but not to computers), but just as unpleasant as all-uppercase text and
difficult to read, stumbling on most missing accents, like proof-reading
one's kid dictation. In the general case, many languages cannot do
without their own characters anyway.
Terms.
A "character" is what one writes down on paper. A "code" is a
computer representation of a set of characters that we can see as
associated to numbers called "code points". A code usually includes
"control characters" for which a graphic representation does not normally
exist, because they are only used to control the operation of hardware or
have special meanings to programs.
7-bit character codes
ASCII (ANSI X3.4) was defined as a 7-bit code for English at a time
when hardware was really hard and expensive. To allow the use of some of
those particular characters that other languages need, it was later
decided that a defined subset (the least used ones) could be replaced.
This is ISO 646. Several language had the subset replaced with their own
characters. This is what can be done with Escape sequence of Epson
printers to switch to a national language. ANSI X3.4 became an instance
of ISO 646. But, for some languages like French, the amount of characters
that can be replaced is not enough and text processing of these days made
extensive use of backspaces and overstrikes for the missing ones. On the
other hand, replacing programming symbols with national characters
introduces much confusion in programming languages, like a comment being
terminated by its own text, and in several uses of those characters (e.g.
in e-mail or Unix) where the national meaning clashes with the ASCII one.
US EBCDIC (an IBM code) used more or less the same characters as
ASCII, but used different code points. I should say "more and less". Some
ASCII characters did not exist in EBCDIC (e. g. square brackets) and
EBCDIC had ones (cent sign, not sign) that were not in ASCII. As a
consequence, the translation between ASCII and EBCDIC was strictly
speaking undefined, and IBM never officially defined a complete one.
Users defined one translation which resulted in a so-called de-facto
EBCDIC containing all the characters of ASCII, that all ASCII-related
programs use. Albeit EBCDIC was an 8-bit code "with holes", IBM made the
same characters replacements as ISO 646 in hardware to be used with other
languages (but, again, as other characters were missing, this was of
little use to French).
Even though data was stored in octets, 7-bit communication line were
used and it was (and still is) common practice for software to strip off
the 8th bit despite a possible extension of the code, future or existing.
We lived a long time of computer frustration. Is the problem solved?
8-bit character codes
Storing in a database text full of "this backspace that", trying to
sort it etc... or getting a Sterling pound bill paid in dollars because
that's what the dollar sign is replaced with in the English version of
ISO 646 was a real pain and an insult to the octet. It was soon realized
that, even if text processing could cope to some extent with compound
characters, data processing could not at all. One character must be one
data element of constant width.
With the era of cheaper hardware and microcomputers, manufacturers
started to use the upper half of the 256 code points of the common 8-bit
byte for international characters. It was one major reason of the success
of these computers over the international place.
But there was no standard and each did it his own way as to which
characters and which code points to use, like to-day's DEC, Apple, Atari,
Commodore or other less known brands. The IBM PC was built with yet
another code that was later called "code page 437" and that everyone in
the compatible business settled on. But IBM also built PCs with
variations for countries using characters that were not in 437, now
called 860, 863 and 865.
There was an evident Babel and a new standard had to be set.
National institutions and many constructors participated to produce the
ISO 8859 standard. As 256 code points are not enough for all languages in
the world, several "versions" of this standard exist (see below for a
list, still evolving). ISO 8859-1 is for group 1 of Latin-based languages
and covers Western Europe, including English, hence many major countries
in North and South America, Australia and many others world wide.
A new multibyte standard is being prepared: ISO 10646 -- in which
ISO 8859-1 is a contiguous subset --, that will cover all languages in a
single code. "Unicode" -- a code being defined by a consortium of
manufacturers -- and ISO 10646 joined: Unicode will be a 2-byte subset of
4-byte ISO 10646, with the remarkable result of a single worldwide code.
Until ISO 10646 can be used, today's hardware and software, strongly
single-byte oriented, can easily extend the scope of a character code to
8 bits and one version of ISO 8859. The particular version used being
implicit to a group of languages is sorry indeed, but it must be
understood that it is a dramatic improvement in a country or a group of
countries where data is implicit anyway.
For short, I may call "ISO 8859" or simply "ISO" in the following
text any version that a system uses at any one time, when assuming that
the systems do not switch versions dynamically, but that the user can
setup the choice of the version he uses, if not implied by hardware.
ISO 8859 (any version) is an extension of ASCII. The upper half (in
fact, 128-159 are reserved for more control characters) is filled with
characters for a group of countries. The present trend to use ISO 8859 is
certain. Version 1 is much like the previous DEC's "8-bit ASCII code",
and VT terminals now have a setup to use 8 bits and ISO 8859 (and Escape
sequences to switch among and display several ISO 8859 versions). Looking
at Microsoft and Lotus international codes, one notices that they had
soon adopted a "pre-release" of ISO 8859-1 (Microsoft calls ISO 8859-1
"ANSI code" in their documentation of Windows). As explained below, IBM
have adopted ISO 8859-1 their own way. X-Windows specifications (from
MIT, of a presentation system on a remote graphic terminal) prescribe
that ISO 8859-1 is to be used on the communication line. By mutual
agreement, a growing number of universities and institutions exchange
data in ISO.
ISO 8859-1, Latin Alphabet 1, for Dutch, English, Faeroese, Finnish,
French, German, Icelandic, Irish, Italian, Norwegian, Portuguese,
Spanish, and Swedish.
ISO 8859-2, Latin Alphabet 2. Albanian, Czech, English, German,
Hungarian, Polish, Romanian, Serbocroation, Slovak, and Slovene.
ISO 8859-3, Latin Alphabet 3, for Afrikaans, Catalan, English, Esperanto,
French, Galician, German, Italian, Maltese, and Turkish.
ISO 8859-4, Latin Alphabet 4, for Danish, English, Estonian, Finnish,
German, Greenlandic, Lappish, Latvian, Lithuanian, Norwegian, and
Swedish.
ISO 8859-5, the Latin/Cyrillic Alphabet, for Bulgarian, Byelorussian,
Macedonian, Russian, Serbocroation, and Ukrainian.
ISO 8859-6, the Latin/Arabic Alphabet.
ISO 8859-7, the Latin/Greek Alphabet.
ISO 8859-8, the Latin/Hebrew Alphabet.
ISO 8859-9, Latin Alphabet 5, for Danish, Dutch, English, Faeroese,
Finnish, French, German, Irish, Italian, Norwegian, Portuguese, Spanish,
Swedish, and Turkish.
The "foreign" environment.
So, these facts of languages have our typewriters different, and the
computer keyboards are modelled after them. A few letters moved about,
digits on the uppercase side, accented letters in place of programming
symbols etc... More striking, if you pardon the pun, is that -- because
the amount of keys is not enough for all the French characters -- some
so-called dead-keys are used to compose accented letters by a strike of
them followed by another letter, giving a single code point as program
input, just like a typewriter could overtype.
It must be realized that, to an international computer user, an 8-
bit code is just as natural as the 7-bit one of English-speaking users.
8-bit code points "come out" some plain keys of the keyboard and are
expected to display. If a program filters them out, this will be
shocking. If it uses these code points for internal control functions,
the user will be confused with "strange behavior" a US keyboard would
never exhibit. For example, if it strips the 8th bit of a PC e-acute, it
produces a disturbing linefeed. Or if a program decides that normal
characters belong to the range 32-127, this will play havoc. It is worth
checking a program with such data, that some keyboards can produce with
alternate input.
Trust little about the keyboard layout and physical scan-codes. The
only reliable input is through the operating system or country-
configurable keyboard driver interface. Working with physical input is
trying to duplicate the varying and sometimes complicated logic of those
drivers (maybe covering several keyboards per country) and heading for
problems or incomplete coverage. Assuming that one can use transformation
of one strike to one code point is incorrect, because of the dead-keys.
Using the state of special keys of the PC (Shift, Ctrl, Alt etc...) to
try to modify the meaning of what the system outputs (a usual feature of
communication programs) is not the best idea either, because keyboard
recorders rarely replay the shift states along with that output. And, in
general, mixing input from different levels is unsafe: strictly speaking,
these states are asynchronous with the input, one may read a key code
when the shift state has disappeared. Yes, a program is usually faster
than the user, but can one swear that a fast, long buffered auto repeat
makes this true in all cases? Imagine your output being blocked by
network flow control... Oh yes, it can happen.
As an example, here is what can be done on that PC I know well. The
keyboard driver outputs 2 bytes, H and L.
When H is nonzero, it is the physical position of the key pressed;
so, unless the documentation really wishes to refer to the key by
position and not keycap for such things as diamond-shaped or in-a-row key
groups, ignore this value and simply use L as final data: it is extended
ASCII to be used as such (or, at most to go through a code translation as
discussed hereafter). Note that different keys (different H) may produce
the same code-point (L); e.g. L is 0 for an Alt/literal-number of the PC.
When H is zero, a special key combination has been pressed,
indicated by the value of L to be used to index a table of actions of the
program. The PC defines 166 such special key combinations (0+L) and the
intention of the application designer -- when using modifying shift
states -- is to provide more, or, also, those the user really wants. The
90 values of L are probably enough additional definitions (but some or
all of the 166 could even be redefined or even "impossible" H combined
with a 256 L multiplicator).
Hence, the simplest method is to assign pre-defined additional
pseudo-scan-codes (0+L) -- and, repeat, certainly not extended ASCII code
points -- to the actions of the program and to manage to have the
keyboard driver produce them on any key the user chooses. Here is how to
do that.
Each time a key is pressed (or released), the keyboard driver -- be
it in ROM or the keyb... driver -- calls software interrupt 15h with 4Fh
in register AH, with carry flag set and the physical scan code of the key
in register AL (ored with 80h when released, so that this case can easily
be ignored). The application may intercept interrupt 15h, test for a key
it wants together with the shift keys states it wants (safe at this
level).
- If AH is not 4Fh or AL is an unwanted key, the processor's flags and
registers are left as on program entry (with carry set), and control is
transferred to the next interrupt 15 handler (as any well-behaved
interceptor must do); eventually, the keystroke will be used or ignored
by someone else. Usually, this transfers to a dummy interrupt and returns
to the keyboard driver to use the keystroke in the normal way.
- If the key is wanted, interrupt 16 is called with AH=5 and CH+CL set to
what is to be placed in the keyboard buffer queue to be the input to the
application. Then, return is made to the caller with carry flag cleared
to indicate to the keyboard driver that the keystroke is used and that it
is to ignore it and clean up the hardware interrupt.
- One can insert anything in the keyboard buffer, extended ASCII (PC
code) or pseudo-scan-code: a keyboard recorder will receive that and
replay it faithfully (but, of course, your inventions will be meaningful
only to your own application). This is the way to even produce "Enter"
with the right Ctrl key as IBM 3270 emulations do (if you really insist,
I personally hate this). However, remember that inserting extended ASCII
may be in conflict with the choice of a particular keyboard or code page
for which it is different: again, the keyboard driver knows much better
about that.
I am no specialist of the Macintosh internals, but I guess there's a
similar story to tell for it.
8-bit codes in communications.
We now realize that exchanging data between those computers with
proprietary 8-bit codes is to international users exactly like sending
data from an ASCII machine to an EBCDIC one: translation has to occur
somewhere.
Which is to translate what to what? Communication, if to work at
all, relies heavily on strict standards. If communication between EBCDIC
and ASCII computers is feasible, it is because of the well known fact --
so well one often forgets to state it -- that character on a
communication line must be ASCII. Just imagine there would be nothing
such. Just realize that there is no clearly spoken equivalent for
international characters, just tacit agreement.
It is urgently needed to stop any sorts of hacking. I know of at
least 25 different codes with characters similar to ISO 8859-1 that a
file receiver would have to try to detect and know if there were no rule.
This makes over 1000 translation tables. This text advocates standard
communication and simplicity with one code on a given computer.
The only solution is to state that each and every octet of text data
carried on a communication line cannot be anything else that an official
standard and that, while waiting for a single multi-octet standard, each
language uses only one standard. ISO 8859 fills this purpose and is the
only official standard. It is already used by major firms and some
protocols like X-Windows.
Conclusions.
A) An "8-bit clean" computer is one allowing characters to have the 8th
bit set. If such a computer (like more and more Unixes of these days) is
to choose a code, the obvious, painless one to avoid any translation is
the standard: a version of ISO 8859. Note that such a machine becomes
code-dependent only by 1) the system messages in the user's language and
2) the terminals and other peripherals used to display and enter the data
(hence, other messages). It might seem that owing a uniform environment
of PCs or Macs and their printers could make their code the best choice
for a near Unix machine. On the long range, this will cause problems when
that environment will be integrated in networking with other sites. And
internetworking is moving fast and spreads standards. Better start right
than have a computerfull of data to translate one day. By the way, note
that most terminal emulators already use ISO 8859-1.
B) If a computer is forced to continue using a code different from but
with a character set similar to a version of ISO 8859, it must behave
with regard of what it sends on and receives from communication lines as
if it were using that version of ISO. This means that the key feature of
protocols (like file transfer in text mode or electronic mail) is to
implement translation of the data that this protocol exchanges with the
communication line. This applies to both services provided by a host and
terminal (client) functions provided by stations. In normal usage, this
translation is expected to always be to ISO 8859, but, to ease the
transition period, the translation may be selectable, especially to
revert to the compatible case of null translation. However, the user
should be advised that the preferred translation is to ISO (and that it
in no way impairs communication restricted to ASCII).
In such a case, a requirement is to define a "best fit" translation
between the proprietary code and that ISO version for text file transfer.
Characters identical in both sets produce a meaningful code point
translation; the translation of other characters is arbitrary but must be
well defined. The important point is that this translation must be one to
one and invertible for all the 256 characters (that is, each character
translates to a different one and the reverse translation returns the
original value). The translation of the lower half of an extension of
ASCII is null. This kind of translation is valuable even if translating
characters to totally different ones in operations like file transfer,
instead of trying to obtain look-alike or multiple ones. The reason is
that doing otherwise may permanently corrupt data that cannot be fully
processed later, be it just to return or forward it. It is better to
obtain partially meaningless data (in appearence) and to be able to
process it locally (e.g. print it more meaningfully) than to assume that
the goal of network transfer is final usage. Note that if a system does
not use a subset of the code points, it may have to receive files from
systems that do.
A main difficulty is that this translation should be unique for a
given system, so that two computers running this system be able to
exchange data of their own code under the above rules (translation to
ISO) without data loss. It is clear that a proprietary communication
protocol (like NETBIOS) can use the proprietary code without translation.
(Yet, one day, that protocol (like NETBIOS) may well extend to other
computers, causing difficulty.) But, in internetworking, and especially
with electronic mail, it should not be expected from a computer to
necessarily know the type of machine (hence code) of the other party.
The constructor (the owner of the proprietary code) should define
this translation precisely but sometimes fails to do so. In consequence,
one goal of this document is to suggest one as widely as possible.
Terminal emulation deserves a special discussion. For communication
programs (usually providing VT100 terminal emulation), it is not
necessary to provide the full features of the higher VT models that can
switch character codes to achieve international characters support.
Moreover, it is not desirable to ask that the hosts a terminal is
connected to have to send character codes switching escape sequences in
order to initiate the use of national characters. What is needed is just
to be able to setup terminal mode with an initial state of what display
the GR code points (values above 127). This way, using ISO 8859 will only
be a "matter of fact" to the 8-bit-clean host and neither has to know
about code switching. This is especially true when the only possible
display a microcomputer can achieve is by translating ISO from the line
to its own similar character set, like the IBM PC or an Apple Macintosh
with standard fonts. In short, VT100 emulation is sufficient, but with
added translation before display and from the keyboard.
Now, one important remark about implementing translation with a
proprietary code in a communication program. Two methods are possible.
A) Text is translated at the communication line interface. Hence, the
proprietary code is used for text in computer memory.
B) Text is translated at the other system interface (screen, keyboard,
file). Hence, ISO 8859 is used for text in memory.
The choice of the method depends on a number of factors.
- If the communication protocol is such that all data on the line is
text, method A is the easiest. If there is a mix of text and binary and
an minimum of interface points where text can be translated is not found,
then method B should be considered.
- If the system interfaces can be well localized (e. g. routines in the
program to interface the screen, keyboard and files of the PC), method B
is easy. Else (e. g. the Macintosh where multiple system interface exist
with text as parameters) method A may be better (unless maybe, on the
Mac, ISO fonts were used just for this reason, not very practical except
for a terminal emulation program).
- If the proprietary code is not unique (like multiple in use on the PC),
method B is best unless an interface is built to translate the internal
program messages to the current code.
- Using ISO in memory makes the program messages more portable.
Two typical examples: a terminal emulation with file transfer
(Kermit style) on a PC used method B with advantages; a file transfer
program (TCP/IP FTP) on a Mac used method A with great simplicity (e. g.
the filenames in the FTP dialog were translated altogether when method B
would have required to act at various points of the Macintosh API).
Moral.
I can hear those having read this far say they did not suspect such
problems. You will now understand why it is important to write 8-bit
clean software, to use a single code on one computer, that by far the
most interesting to-day is ISO 8859 (the Unix advice) and why
applications running on inconvertible systems have to translate text.
IBM and ISO 8859-1 (general, see details before the IBM tables)
For the PC, IBM has now adopted the character set of ISO 8859-1 with
a different code. This was done by replacing some characters of the
original PC code, now called code page 437, to obtain the full character
set of ISO 8859-1. This new code is called "code page 850" and IBM sees
it as the preferred code page for all Latin1 customers (it's their
default code for OS/2). See the appendix D of the "DOS reference manual"
for a description of 850 and the code pages it may replace: 437, 860, 863
and 865. Beware, the yen, cent, and two paragraph symbols that existed in
437 were moved in 850. When one builds a translation table between 850
and ISO 8859-1, 32 characters of 850, mainly box-drawing, are left to be
assigned to the 32 control characters 80-9F of ISO.
For the EBCDIC mainframes, IBM decided that, because terminals were
already using the ISO-646-like replacements to the US EBCDIC, they had to
stay compatible. They extended each such "national EBCDIC" to "country
extended code pages". Thus, there are as many EBCDICs as versions of ISO
646 (what ISO 8859 is trying to avoid). None of the original CECPs was
compatible with the de-facto EBCDIC. Lately, IBM defined CECP 1047 which
is compatible with (an extension of) the de-facto US EBCDIC (see
discussion below). In consequence, I consider that CECP 1047 is the most
interesting EBCDIC code to use, because of the compatibility with the
vast software base.
CECP 1047 "internationalized industry standard" (my terms)
CECP 037 for US, Canada-French, Netherlands, Portugal.
CECP 273 for Germany.
CECP 277 for Denmark and Norway.
CECP 278 for Finland and Sweden.
CECP 280 for Italy.
CECP 284 for Latin America and Spain.
CECP 285 for United Kingdom.
CECP 297 for France.
CECP 500 for Belgium, Switzerland-French and Switzerland-German.
Like 850, all these codes contain all the characters of ISO 8859-1.
Only the recent CECP 1047 is compatible with a de-facto standard
EBCDIC, corresponding to a de-facto ASCII/EBCDIC translation, that a huge
amount of products settled on long ago, including software from IBM:
- all compilers from IBM or others: C, REXX, PL/I, Pascal, for those
sensitive to the differences in code points,
- File transfer programs like Kermit, PCTERM, and IBM TCP/IP,
- In fact, the whole of IBM TCP/IP,
- Terminal emulation: TTY line mode or 3270 emulation by the 7171,
- ASCII tapes translation,
- Products to translate ASCII to EBCDIC on a mainframe: ARCUTIL ...
- Products that should produce ASCII, but produce EBCDIC because data
goes through EBCDIC/ASCII translation: e. g. SAS output for Tektronix,
- Products that convert this output anyway, because the expected
EBCDIC/ASCII translation does not occur: LINEMODE through the 7171
transparent mode,
- Similarly, TPRINT to print in this transparent mode
- Certainly many other products I don't know of or I forget, because, as
you see, the de-facto EBCDIC snowballs from one use to the other,
- Last but far from least, it's the translation made by most gateways
that relay mail between BITNET and the Internet, i.e. between EBCDIC mail
and ASCII mail. Of special importance is that of the encoding of data
that is to be transmitted by e-mail (UUENCODE, BOO, HQX...): if the
ASCII-EBCDIC-ASCII translation fails to be invertible, decoding fails.
The requirement #1 of SHARE is that IBM use a single EBCDIC code for
Latin group 1 and publish it. Using an extension of de-facto EBCDIC is
recommended.
Asynchronous communication
Thanks to the interest of Frank da Cruz and Christine Gianone,
Kermit now defines specifications to support ISO 8859 (and other codes if
needed) on the communication line in terminal and file transfer mode. It
has provision to extend to mixed codes files too.
John Chandler has extended the traditional translation made by his
remarkable IBM mainframe Kermits to the specific choice of any CECP or
the extended de-facto EBCDIC to ISO 8859-1.
The impressive MSDOS Kermit by Joe Doupnik now also supports
translation of PC code pages to ISO8859-1.
Thanks to Paul Placeway, Macintosh Kermit now supports ISO 8859-1 as
an 8-bit line terminal. Others have taken over the job to complete it.
I think I can speak on behalf on the international computing
community and enthusiastically thank these people for a work most useful
to them.
TCP/IP
Despite a mention I have read in an introduction to the TCP/IP
communication protocols "provision for hosts with different character
sets", the idea does not extend much into the standards. In fact, some of
them even restrict text to 7-bit explicitly and without more reason that
some points of forgotten history. No attempt is made to make a statement
to standardize what must be an 8-bit code so that it be common to all
machines, just like ASCII is, as explained above.
In practice, it is often no more than a question of implementation:
use ISO 8859 as the code of a machine or translate the proprietary code
to ISO 8859. At the time of writing the first version of this text, just
EBCDIC mainframes did translate, because the need appeared evident; it
was restricted to the US ASCII character set, but a simple table change
extends the scope of all protocols. For international characters users,
the same problem and solution exists for any host not using ISO 8859. As
of this writing, the most important applications on the Macintosh have
applied the principles: Eudora (POP3) by Steve Dorner, Brown tn3270 by
Peter DiCamillo, Fetch (FTP client) by Jim Matthews, FTPd (FTP server)
and other programs by Peter Lewis which cope with translation, exactly
to-day NCSA/BYU/UCL Telnet by Pascal Maes, of course Mac-X from Apple and
even others still to check. IBM PC, statu quo: just Telnet by IBM (both
vt100 and tn270) and several other firms. Thanks to the authors!
The idea to translate the data does not come to the mind of the
persons who write the TCP/IP applications because they don't know the
problem. If the protocol speaks about it, the application will probably
be written correctly for that matter. For example, the specifications of
X-Windows state that ISO 8859-1 is the code that must be used to exchange
text between the client and the server of that protocol; and all X-
Windows applications are correct.
So, failing to rewrite most RFCs just for this, what is needed is a
general TCP/IP statement saying what single code TCP/IP application
protocols use on communication lines: ISO 8859 with future migration to
ISO 10646. This would be like adding a minimal presentation layer.
Specific TCP/IP cases.
Telnet. Take the most basic VT100 implementation, treat the keyboard
as explained above (translating keyboard input to ISO 8859), translate
ISO to local code before display and you've done it. No need to try to
negotiate binary (I am told it even hurts and binary has nothing to do in
my mind with the fact that the text a particular terminal uses is 8-bit).
Note that anyone afraid of the 8th bit can limit his typing to ASCII; his
host will not return him anything else and the upgraded program will
behave exactly like before. Also note that ISO 8859 does not conflict
with the 8-bit control characters and that using ISO is simplification.
No need to wonder or negotiate if the host will send them: if any byte in
the range 80-9F comes in, you may treat it as control.
Tn3270. Like IBM mainframes, it is forced to translate. So, it's
just a matter of using the correct tables. It will save your time not to
try to support all the EBCDIC CECPs. Using CECP 1047 will probably make
everybody happy. However, make the translation customizable. If someone
wants things differently, it will probably be a whole installation with
time to customize it.
SMTP. Despite RFC 821 restricts data to 7-bits, it works quite well
with 8. We use 8-bit mail on Unix (Sun and IBM), on IBM mainframes and on
Macintosh to the delight of our users. It's just a matter of not crossing
8th-bit-stripper gateways. For the Internet, we do not use such hosts as
less preferred MXes and we expect that sites wanting 8 bits will do so.
Together with many other sites, we use ISO 8859-1. No problem!
So, that's just what it is needed for the Internet: kill 8th bit
killers or don't use them. Other networks should be expected to do so
with their mail and use the correct gateways with the Internet.
The BITNET/Internet gateways, for example, should translate between
ISO 8859-1 and CECP 1047.
The same general rules for translation as explained above for file
transfer apply to FTP and other protocols. Note that text vs binary is a
distinction to introduce in additional places, maybe. For example, NFS
would benefit from using it (and best at the file level).
General conclusions
1) Every effort should be made so that all operating systems' codes be
unique and universal, i. e. ISO 8859-x for an 8-bit code, while waiting
for the perfect unity of a single multibyte code.
2) Failing that, communication software must palliate a particular system
weakness and translate data so that it appears to the outside world to
use the unique data interchange code.
3) Programers must deal with 8-bit character codes (and prepare for
multibytes ones).
Translation.
I have been looking for constructor-defined or most widely accepted
complete tables and I explain the reasons of the choices. However, I
cannot guarantee that another translation will not be used someday. The
data correspond to my explanations. That's all I can say.
DEC
Easy case first. DEC uses ISO 8859-1 (just a few characters of their
8-bit code -- pre-dating ISO 8859 -- are different). Nothing to do except
making sure the 8 bits go through.
IBM translations
Since version 1 of this document, IBM has published the following
"Character Data Representation Architecture" (CDRA) documents:
GC09-1392-00 Executive Overview
GC09-1390-00 Level 1, Reference
GC09-1391-00 Level 1, Registry
The latter answers most of the former questions about translation.
IBM has also published a new EBCDIC CECP 1047 that fulfills the
requirements of compatibility with the previous de-facto EBCDIC. However,
IBM has made no statement I know about support nor whether this code is
intended to be the sole one for Latin-1 languages.
In consequence of the SHARE requirement (the necessity to use a
single compatible code on IBM mainframes), I think with many people that
only CECP 1047 should be used on EBCDIC mainframes. And, by extension,
only CP 850 on the PC (but ISO 8859-1 would be better). The PC may also
use CP 437 (e.g. when 850 is not available) as limited use of a subset of
the ISO character set. But, even if using CP 437, a PC should use the
same translation to ISO as for CP 850. Only 4 characters need to
translate differently and those needing them are expected to use CP 850.
The translation tables listed below are limited to these two codes
(others are found in a separate file).
A problem exists with the translation of CECP 850 with ISO. As
published in the CDRA registry, the translation of the ASCII part is not
a null translation. This has simply been corrected below. But the IBM
translation also does not implement round trip integrity with PC to
EBCDIC translation published and used by IBM products (specifically, 850-
>500 is not 850->ISO->500). So, this table may be subject to change.
Unless IBM decide that the wrong table is CECP 1047 with ISO. Unless they
say nothing and don't mind that they have set their Communication Manager
wrong. The change would only affect the range 80-AF of the ISO control
characters, though.
ISO 8859-1 to CECP 1047 (Extended de-facto EBCDIC):
00 01 02 03 37 2D 2E 2F 16 05 25 0B 0C 0D 0E 0F
10 11 12 13 3C 3D 32 26 18 19 3F 27 1C 1D 1E 1F
40 5A 7F 7B 5B 6C 50 7D 4D 5D 5C 4E 6B 60 4B 61
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 7A 5E 4C 7E 6E 6F
7C C1 C2 C3 C4 C5 C6 C7 C8 C9 D1 D2 D3 D4 D5 D6
D7 D8 D9 E2 E3 E4 E5 E6 E7 E8 E9 AD E0 BD 5F 6D
79 81 82 83 84 85 86 87 88 89 91 92 93 94 95 96
97 98 99 A2 A3 A4 A5 A6 A7 A8 A9 C0 4F D0 A1 07
20 21 22 23 24 15 06 17 28 29 2A 2B 2C 09 0A 1B
30 31 1A 33 34 35 36 08 38 39 3A 3B 04 14 3E FF
41 AA 4A B1 9F B2 6A B5 BB B4 9A 8A B0 CA AF BC
90 8F EA FA BE A0 B6 B3 9D DA 9B 8B B7 B8 B9 AB
64 65 62 66 63 67 9E 68 74 71 72 73 78 75 76 77
AC 69 ED EE EB EF EC BF 80 FD FE FB FC BA AE 59
44 45 42 46 43 47 9C 48 54 51 52 53 58 55 56 57
8C 49 CD CE CB CF CC E1 70 DD DE DB DC 8D 8E DF
inverted,
CECP 1047 (Extended de-facto EBCDIC) to ISO 8859-1:
00 01 02 03 9C 09 86 7F 97 8D 8E 0B 0C 0D 0E 0F
10 11 12 13 9D 85 08 87 18 19 92 8F 1C 1D 1E 1F
80 81 82 83 84 0A 17 1B 88 89 8A 8B 8C 05 06 07
90 91 16 93 94 95 96 04 98 99 9A 9B 14 15 9E 1A
20 A0 E2 E4 E0 E1 E3 E5 E7 F1 A2 2E 3C 28 2B 7C
26 E9 EA EB E8 ED EE EF EC DF 21 24 2A 29 3B 5E
2D 2F C2 C4 C0 C1 C3 C5 C7 D1 A6 2C 25 5F 3E 3F
F8 C9 CA CB C8 CD CE CF CC 60 3A 23 40 27 3D 22
D8 61 62 63 64 65 66 67 68 69 AB BB F0 FD FE B1
B0 6A 6B 6C 6D 6E 6F 70 71 72 AA BA E6 B8 C6 A4
B5 7E 73 74 75 76 77 78 79 7A A1 BF D0 5B DE AE
AC A3 A5 B7 A9 A7 B6 BC BD BE DD A8 AF 5D B4 D7
7B 41 42 43 44 45 46 47 48 49 AD F4 F6 F2 F3 F5
7D 4A 4B 4C 4D 4E 4F 50 51 52 B9 FB FC F9 FA FF
5C F7 53 54 55 56 57 58 59 5A B2 D4 D6 D2 D3 D5
30 31 32 33 34 35 36 37 38 39 B3 DB DC D9 DA 9F
ISO 8859-1 to IBM PC code page 850:
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F
60 61 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F
70 71 72 73 74 75 76 77 78 79 7A 7B 7C 7D 7E 7F
BA CD C9 BB C8 BC CC B9 CB CA CE DF DC DB FE F2
B3 C4 DA BF C0 D9 C3 B4 C2 C1 C5 B0 B1 B2 D5 9F
FF AD BD 9C CF BE DD F5 F9 B8 A6 AE AA F0 A9 EE
F8 F1 FD FC EF E6 F4 FA F7 FB A7 AF AC AB F3 A8
B7 B5 B6 C7 8E 8F 92 80 D4 90 D2 D3 DE D6 D7 D8
D1 A5 E3 E0 E2 E5 99 9E 9D EB E9 EA 9A ED E8 E1
85 A0 83 C6 84 86 91 87 8A 82 88 89 8D A1 8C 8B
D0 A4 95 A2 93 E4 94 F6 9B 97 A3 96 81 EC E7 98
inverted,
IBM PC code page 850 to ISO 8859-1:
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F
60 61 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F
70 71 72 73 74 75 76 77 78 79 7A 7B 7C 7D 7E 7F
C7 FC E9 E2 E4 E0 E5 E7 EA EB E8 EF EE EC C4 C5
C9 E6 C6 F4 F6 F2 FB F9 FF D6 DC F8 A3 D8 D7 9F
E1 ED F3 FA F1 D1 AA BA BF AE AC BD BC A1 AB BB
9B 9C 9D 90 97 C1 C2 C0 A9 87 80 83 85 A2 A5 93
94 99 98 96 91 9A E3 C3 84 82 89 88 86 81 8A A4
F0 D0 CA CB C8 9E CD CE CF 95 92 8D 8C A6 CC 8B
D3 DF D4 D2 F5 D5 B5 FE DE DA DB D9 FD DD AF B4
AD B1 8F BE B6 A7 F7 B8 B0 A8 B7 B9 B3 B2 8E A0
Apple Macintosh
Apple Inc. remained silent to the request for an official
translation table between ISO 8859-1 and the Macintosh code that would
fulfill the data processing requirement of being invertible for the 256
code points. So, I built one and suggested that the Kermit repository
store the data and be the reference for it.
I made the translation as compatible as possible with an existing
translation tables, the official "Apple File Exchange" from Apple Inc.
that translates between IBM PC code and Apple's, hence, indirectly to ISO
8859-1. Many characters of the Apple fonts belong to ISO 8859-1 and
caused no problem. The translation of some characters became
incompatible, because the "Apple File Exchange" is homographic, which
fails to be invertible (e. g. 2 superscript translates to plain 2), and
because the AFE is based on IBM PC 437 that contains some characters of
the Macintosh set that have been replaced (giving IBM PC code page 850)
with characters of ISO 8859-1 (for example, it matched Mac Omega to a 437
Omega that became a 850 U circumflex that now has to match the Mac's F3.)
Several translations that remained arbitrary were preferred to be
homographic or mnemonic. Leftovers from the 80-FF Mac range have simply
be lined up in the 80-9F range of ISO 8859-1 without any particular
reason.
This is a second version of the translation; 6 characters of the
standard Apple code whose translation was arbitrary have been translated
according to their Islandic replacements (plus change of the translation
of the Apple code points to which these ISO characters translated).
Below, you will find comments about the choices (why):
Blank: compatible with AFE (same in both PC 437 and 850).
S: not in 437/AFE, but ISO character is in "Standard Apple Character Set"
E: same for "SACS with extensions" (on newer systems only).
I: translation according to an Islandic Apple font.
A: arbitrary (but choice sometimes guided by lookalike or mnemonic
aspects and a few characters of PC 437 will be preserved).
ISO Mac ISO 8859-1 name (IBM) Why Mac name (Paul Placeway)
80 | A5 | | A | bullet
81 | AA | | A | trade mark
82 | AD | | A | not equal
83 | B0 | | A | infinity
84 | B3 | | A | greater than or equal to
85 | B7 | | A | Uppercase Sigma (Summation)
86 | BA | | A | integral
87 | BD | | A | Uppercase Omega
88 | C3 | | A | radical (square root)
89 | C5 | | A | approx equal
8A | C9 | | A | elipsis (...)
8B | D1 | | A | em dash
8C | D4 | | A | left singlequote ( ` )
8D | D9 | | A | Y dieresis
8E | DA | | A | divide (a / with less slope)
8F | B6 | | A | partial
90 | C6 | | A | Uppercase Delta
91 | CE | | A | OE
92 | E2 | | A | baseline single close quote
93 | E3 | | A | baseline double close quote
94 | E4 | | A | per thousand
95 | F0 | | A | (closed) Apple
96 | F6 | | A | circumflex
97 | F7 | | A | tilde
98 | F9 | | A | breve
99 | FA | | A | dot accent
9A | FB | | A | ring accent
9B | FD | | A | Hungarian umlaut
9C | FE | | A | ogonek
9D | FF | | A | caron
9E | F5 | | A | dotless i
9F | C4 | | A | florin
A0 | CA | required space | A | non-printing space
A1 | C1 | exclamation point inv | | inverted !
A2 | A2 | cent sign | S | cent
A3 | A3 | pound sign | | sterling
A4 | DB | int. currency symbol | E | generic curency
A5 | B4 | Yen sign | S | yen
A6 | CF | Vertical Line, Broken | A | oe
A7 | A4 | section/paragraph symb| S | section
A8 | AC | diaeresis,umlaut acc | S | dieresis (AKA umlaut)
A9 | A9 | Copyright sign | | copyright ( (C) )
AA | BB | ordinal indicator fem | | feminine ordinal
AB | C7 | left angle quotes | | left guillemot (like << )
AC | C2 | logical NOT, EOL symb | | logical not
AD | D0 | Syllabe Hyphen | A | en dash
AE | A8 | Regist.Trade Mark sym | S | registered ( (R) )
AF | F8 | overline | A | macron
B0 | A1 | Degree Symbol | | superscript ring
B1 | B1 | plus or minus sign | | plus minus
B2 | D3 | 2 superscript | A | right doublequote ( '' )
B3 | D2 | 3 superscript | A | left doublequote ( `` )
B4 | AB | acute accent | S | acute accent
B5 | B5 | micro symbol | | greek lowercase mu
B6 | A6 | paragraph symbol USA | S | paragraph
B7 | E1 | Middle dot accent | E | centered (small) dot
B8 | FC | cedilla accent | E | cedilla
B9 | D5 | 1 superscript | A | right singlequote ( ' )
BA | BC | ordinal indicator mas | | masculine ordinal
BB | C8 | right angle quotes | | right guillemot (like >> )
BC | B9 | one quarter | A | lowercase pi
BD | B8 | one half | A | Uppercase Pi (Power)
BE | B2 | three quarters | A | less than or equal to
BF | C0 | Question mark inverted| | inverted ?
C0 | CB | A grave capital | S | A grave
C1 | E7 | A acute capital | E | A accute
C2 | E5 | A circumflex capital | E | A circumflex
C3 | CC | A tilde capital | S | A tilde
C4 | 80 | A diaeresis capital | | A dieresis
C5 | 81 | A overcircle capital | | A ring
C6 | AE | AE diphthong capital | | AE
C7 | 82 | C cedilla capital | | C cedilla
C8 | E9 | E grave capital | E | E grave
C9 | 83 | E acute capital | | E accute
CA | E6 | E circumflex capital | S | E circumflex
CB | E8 | E diaeresis capital | E | E dieresis
CC | ED | I grave capital | E | I grave
CD | EA | I acute capital | E | I accute
CE | EB | I circumflex capital | E | I circumflex
CF | EC | I diaeresis capital | E | I dieresis
D0 | DC | Eth islandic capital | I | < or Eth islandic capital
D1 | 84 | N tilde capital | | N tilde
D2 | F1 | O grave capital | E | O grave
D3 | EE | O acute capital | E | O accute
D4 | EF | O circumflex capital | E | O circumflex
D5 | CD | O tilde capital | S | O tilde
D6 | 85 | O diaeresis capital | | O dieresis
D7 | D7 | Multiply sign | A | lozenge (open diamond)
D8 | AF | O slash capital | E | O slash
D9 | F4 | U grave capital | E | U grave
DA | F2 | U acute capital | E | U accute
DB | F3 | U circumflex capital | E | U circumflex
DC | 86 | U diaeresis capital | | U dieresis
DD | A0 | Y acute Capital | I | dagger or Y acute Capital
DE | DE | Thorn islandic capital| I | fi or Thorn islandic capital
DF | A7 | sharp s small | | Es-sed (German double s)
E0 | 88 | a grave small | | a grave
E1 | 87 | a acute small | | a accute
E2 | 89 | a circumflex small | | a circumflex
E3 | 8B | a tilde small | S | a tilde
E4 | 8A | a diaeresis small | | a dieresis
E5 | 8C | a overcircle small | | a ring
E6 | BE | ae diphthong small | | ae
E7 | 8D | c cedilla small | | c cedilla
E8 | 8F | e grave small | | e grave
E9 | 8E | e acute small | | e accute
EA | 90 | e circumflex small | | e circumflex
EB | 91 | e diaeresis small | | e dieresis
EC | 93 | i grave small | | i grave
ED | 92 | i acute small | | i accute
EE | 94 | i circumflex small | | i circumflex
EF | 95 | i diaeresis small | | i dieresis
F0 | DD | Eth Islandic small | I | > or Eth Islandic small
F1 | 96 | n tilde small | | n tilde
F2 | 98 | o grave small | | o grave
F3 | 97 | o acute small | | o accute
F4 | 99 | o circumflex small | | o circumflex
F5 | 9B | o tilde small | S | o tilde
F6 | 9A | o diaeresis small | | o dieresis
F7 | D6 | Divide sign | | divide
F8 | BF | o slash small | S | o slash
F9 | 9D | u grave small | | u grave
FA | 9C | u acute small | | u accute
FB | 9E | u circumflex small | | u circumflex
FC | 9F | u diaeresis small | | u dieresis
FD | E0 | y acute small | I | double dagger of y acute small
FE | DF | Thorn islandic small | I | fl or Thorn islandic small
FF | D8 | y diaeresis small | | y dieresis
data 'taBL' (1001, "Translate In", purgeable) {
/* Translation from ISO 8859-1 to Macintosh extended code */
/* x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 xA xB xC xD xE xF */
/*0x*/ $"00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F"
/*1x*/ $"10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F"
/*2x*/ $"20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F"
/*3x*/ $"30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F"
/*4x*/ $"40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F"
/*5x*/ $"50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F"
/*6x*/ $"60 61 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F"
/*7x*/ $"70 71 72 73 74 75 76 77 78 79 7A 7B 7C 7D 7E 7F"
/*8x*/ $"A5 AA AD B0 B3 B7 BA BD C3 C5 C9 D1 D4 D9 DA B6"
/*9x*/ $"C6 CE E2 E3 E4 F0 F6 F7 F9 FA FB FD FE FF F5 C4"
/*Ax*/ $"CA C1 A2 A3 DB B4 CF A4 AC A9 BB C7 C2 D0 A8 F8"
/*Bx*/ $"A1 B1 D3 D2 AB B5 A6 E1 FC D5 BC C8 B9 B8 B2 C0"
/*Cx*/ $"CB E7 E5 CC 80 81 AE 82 E9 83 E6 E8 ED EA EB EC"
/*Dx*/ $"DC 84 F1 EE EF CD 85 D7 AF F4 F2 F3 86 A0 DE A7"
/*Ex*/ $"88 87 89 8B 8A 8C BE 8D 8F 8E 90 91 93 92 94 95"
/*Fx*/ $"DD 96 98 97 99 9B 9A D6 BF 9D 9C 9E 9F E0 DF D8"
};
data 'taBL' (1002, "Translate Out", purgeable) {
/* Translation from Macintosh extended code to ISO 8859-1 */
/* x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 xA xB xC xD xE xF */
/*0x*/ $"00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F"
/*1x*/ $"10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F"
/*2x*/ $"20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F"
/*3x*/ $"30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F"
/*4x*/ $"40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F"
/*5x*/ $"50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F"
/*6x*/ $"60 61 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F"
/*7x*/ $"70 71 72 73 74 75 76 77 78 79 7A 7B 7C 7D 7E 7F"
/*8x*/ $"C4 C5 C7 C9 D1 D6 DC E1 E0 E2 E4 E3 E5 E7 E9 E8"
/*9x*/ $"EA EB ED EC EE EF F1 F3 F2 F4 F6 F5 FA F9 FB FC"
/*Ax*/ $"DD B0 A2 A3 A7 80 B6 DF AE A9 81 B4 A8 82 C6 D8"
/*Bx*/ $"83 B1 BE 84 A5 B5 8F 85 BD BC 86 AA BA 87 E6 F8"
/*Cx*/ $"BF A1 AC 88 9F 89 90 AB BB 8A A0 C0 C3 D5 91 A6"
/*Dx*/ $"AD 8B B3 B2 8C B9 F7 D7 FF 8D 8E A4 D0 F0 DE FE"
/*Ex*/ $"FD B7 92 93 94 C2 CA C1 CB C8 CD CE CF CC D3 D4"
/*Fx*/ $"95 D2 DA DB D9 9E 96 97 AF 98 99 9A B8 9B 9C 9D"
};
ISO 8859-1
Here is a names list and graphic representation of the ISO 8859-1
code. The well-known ASCII part and control characters have been left out
to shorten the text. They are included for practical programming help
only. In particular, the "bitmaps" are nothing official. For convenience,
two lists of names and acronyms are given: the first comes from IBM, the
second from a list of characters of the standard IS0 6937.
Code point in hexadecimal / Acronym / Name. Origin: IBM.
A0 | SP30 | required space D0 | LD62 | Eth islandic capital
A1 | SP03 | exclamation point inv D1 | LN20 | N tilde capital
A2 | SC04 | cent sign D2 | LO14 | O grave capital
A3 | SC02 | pound sign D3 | LO12 | O acute capital
A4 | SC01 | int. currency symbol D4 | LO16 | O circumflex capital
A5 | SC05 | Yen sign D5 | LO20 | O tilde capital
A6 | SM65 | Vertical Line, Broken D6 | LO18 | O diaeresis capital
A7 | SM24 | section/paragraph symb D7 | SA07 | Multiply sign
A8 | SD17 | diaeresis,umlaut acc D8 | LO62 | O slash capital
A9 | SM52 | Copyright sign D9 | LU14 | U grave capital
AA | SM21 | ordinal indicator fem DA | LU12 | U acute capital
AB | SP17 | left angle quotes DB | LU16 | U circumflex capital
AC | SM66 | logical NOT, EOL symb DC | LU18 | U diaeresis capital
AD | SP32 | Syllabe Hyphen DD | LY12 | Y acute Capital
AE | SM53 | Regist.Trade Mark sym DE | LT64 | Thorn islandic capital
AF | SM15 | overline DF | LS61 | sharp s small
B0 | SM19 | Degree Symbol E0 | LA13 | a grave small
B1 | SA02 | plus or minus sign E1 | LA11 | a acute small
B2 | ND021| 2 superscript E2 | LA15 | a circumflex small
B3 | ND031| 3 superscript E3 | LA19 | a tilde small
B4 | SD11 | acute accent E4 | LA17 | a diaeresis small
B5 | SM17 | micro symbol E5 | LA27 | a overcircle small
B6 | SM25 | paragraph symbol USA E6 | LA51 | ae diphthong small
B7 | SD63 | Middle dot accent E7 | LC41 | c cedilla small
B8 | SD41 | cedilla accent E8 | LE13 | e grave small
B9 | ND011| 1 superscript E9 | LE11 | e acute small
BA | SM20 | ordinal indicator mas EA | LE15 | e circumflex small
BB | SP18 | right angle quotes EB | LE17 | e diaeresis small
BC | NF04 | one quarter EC | LI13 | i grave small
BD | NF01 | one half ED | LI11 | i acute small
BE | NF05 | three quarters EE | LI15 | i circumflex small
BF | SP16 | Question mark inverted EF | LI17 | i diaeresis small
C0 | LA14 | A grave capital F0 | LD63 | Eth Islandic small
C1 | LA12 | A acute capital F1 | LN19 | n tilde small
C2 | LA16 | A circumflex capital F2 | LO13 | o grave small
C3 | LA20 | A tilde capital F3 | LO11 | o acute small
C4 | LA18 | A diaeresis capital F4 | LO15 | o circumflex small
C5 | LA28 | A overcircle capital F5 | LO19 | o tilde small
C6 | LA52 | AE diphthong capital F6 | LO17 | o diaeresis small
C7 | LC42 | C cedilla capital F7 | SA06 | Divide sign
C8 | LE14 | E grave capital F8 | LO61 | o slash small
C9 | LE12 | E acute capital F9 | LU13 | u grave small
CA | LE16 | E circumflex capital FA | LU11 | u acute small
CB | LE18 | E diaeresis capital FB | LU15 | u circumflex small
CC | LI14 | I grave capital FC | LU17 | u diaeresis small
CD | LI12 | I acute capital FD | LY11 | y acute small
CE | LI16 | I circumflex capital FE | LT63 | Thorn islandic small
CF | LI18 | I diaeresis capital FF | LY17 | y diaeresis small
Names and slightly different acronyms from the ISO 6937 repertoire
A0 SP31 NO-BREAK SPACE
A1 SP03 INVERTED EXCLAMATION MARK
A2 SC04 CENT SIGN
A3 SC02 POUND SIGN
A4 SC01 CURRENCY SIGN
A5 SC05 YEN SIGN
A6 SM65 BROKEN BAR
A7 SM24 PARAGRAPH SIGN
A8 SD17 DIAERESIS
A9 SM52 COPYRIGHT SIGN
AA SM21 FEMININE ORDINAL INDICATOR
AB SP17 LEFT POINTING DOUBLE ANGLE QUOTATION MARK
AC SM66 NOT SIGN
AD SP32 SOFT HYPHEN
AE SM53 REGISTERED TRADE MARK SIGN
AF SD31 MACRON
B0 SM19 DEGREE SIGN
B1 SA02 PLUS-MINUS SIGN
B2 NS02 SUPERSCRIPT TWO
B3 NS03 SUPERSCRIPT THREE
B4 SD11 ACUTE ACCENT
B5 SM17 MICRO SIGN
B6 SM25 PILCHROW SIGN
B7 SM26 MIDDLE DOT
B8 SD41 CEDILLA
B9 NS01 SUPERSCRIPT ONE
BA SM20 MASCULINE ORDINAL INDICATOR
BB SP18 RIGHT POINTING DOUBLE ANGLE QUOTATION MARK
BC NF04 VULGAR FRACTION ONE-QUARTER
BD NF01 VULGAR FRACTION ONE-HALF
BE NF05 VULGAR FRACTION THREE-QUARTERS
BF SP16 INVERTED QUESTION MARK
C0 LA14 LATIN CAPITAL LETTER A WITH GRAVE ACCENT
C1 LA12 LATIN CAPITAL LETTER A WITH ACUTE ACCENT
C2 LA16 LATIN CAPITAL LETTER A WITH CIRCUMFLEX ACCENT
C3 LA20 LATIN CAPITAL LETTER A WITH TILDE
C4 LA18 LATIN CAPITAL LETTER A WITH DIAERESIS
C5 LA28 LATIN CAPITAL LETTER A WITH RING ABOVE
C6 LA52 LATIN CAPITAL LIGATURE AE
C7 LC42 LATIN CAPITAL LETTER C WITH CEDILLA
C8 LE14 LATIN CAPITAL LETTER E WITH GRAVE ACCENT
C9 LE12 LATIN CAPITAL LETTER E WITH ACUTE ACCENT
CA LE16 LATIN CAPITAL LETTER E WITH CIRCUMFLEX ACCENT
CB LE18 LATIN CAPITAL LETTER E WITH DIAERESIS
CC LI14 LATIN CAPITAL LETTER I WITH GRAVE ACCENT
CD LI12 LATIN CAPITAL LETTER I WITH ACUTE ACCENT
CE LI16 LATIN CAPITAL LETTER I WITH CIRCUMFLEX ACCENT
CF LI18 LATIN CAPITAL LETTER I WITH DIAERESIS
D0 LD62 LATIN CAPITAL LETTER D WITH STROKE
D1 LN20 LATIN CAPITAL LETTER N WITH TILDE
D2 LO14 LATIN CAPITAL LETTER O WITH GRAVE ACCENT
D3 LO12 LATIN CAPITAL LETTER O WITH ACUTE ACCENT
D4 LO16 LATIN CAPITAL LETTER O WITH CIRCUMFLEX ACCENT
D5 LO20 LATIN CAPITAL LETTER O WITH TILDE
D6 LO18 LATIN CAPITAL LETTER O WITH DIAERESIS
D7 SA07 MULTIPLICATION SIGN
D8 LO62 LATIN CAPITAL LETTER O WITH OBLIQUE STROKE
D9 LU14 LATIN CAPITAL LETTER U WITH GRAVE ACCENT
DA LU12 LATIN CAPITAL LETTER U WITH ACUTE ACCENT
DB LU16 LATIN CAPITAL LETTER U WITH CIRCUMFLEX ACCENT
DC LU18 LATIN CAPITAL LETTER U WITH DIAERESIS
DD LY12 LATIN CAPITAL LETTER Y WITH ACUTE ACCENT
DE LT64 LATIN CAPITAL LETTER ICELANDIC THORN
DF LS61 LATIN SMALL LETTER GERMAN SHARP S
E0 LA13 LATIN SMALL LETTER A WITH GRAVE ACCENT
E1 LA11 LATIN SMALL LETTER A WITH ACUTE ACCENT
E2 LA15 LATIN SMALL LETTER A WITH CIRCUMFLEX ACCENT
E3 LA19 LATIN SMALL LETTER A WITH TILDE
E4 LA17 LATIN SMALL LETTER A WITH DIAERESIS
E5 LA27 LATIN SMALL LETTER A WITH RING ABOVE
E6 LA51 LATIN SMALL LIGATURE AE
E7 LC41 LATIN SMALL LETTER C WITH CEDILLA
E8 LE13 LATIN SMALL LETTER E WITH GRAVE ACCENT
E9 LE11 LATIN SMALL LETTER E WITH ACUTE ACCENT
EA LE15 LATIN SMALL LETTER E WITH CIRCUMFLEX ACCENT
EB LE17 LATIN SMALL LETTER E WITH DIAERESIS
EC LI13 LATIN SMALL LETTER I WITH GRAVE ACCENT
ED LI11 LATIN SMALL LETTER I WITH ACUTE ACCENT
EE LI15 LATIN SMALL LETTER I WITH CIRCUMFLEX ACCENT
EF LI17 LATIN SMALL LETTER I WITH DIAERESIS
F0 LD63 LATIN SMALL LETTER ICELANDIC ETH
F1 LN19 LATIN SMALL LETTER N WITH TILDE
F2 LO13 LATIN SMALL LETTER O WITH GRAVE ACCENT
F3 LO11 LATIN SMALL LETTER O WITH ACUTE ACCENT
F4 LO15 LATIN SMALL LETTER O WITH CIRCUMFLEX ACCENT
F5 LO19 LATIN SMALL LETTER O WITH TILDE
F6 LO17 LATIN SMALL LETTER O WITH DIAERESIS
F7 SA06 DIVISION SIGN
F8 LO61 LATIN SMALL LETTER O WITH OBLIQUE STROKE
F9 LU13 LATIN SMALL LETTER U WITH GRAVE ACCENT
FA LU11 LATIN SMALL LETTER U WITH ACUTE ACCENT
FB LU15 LATIN SMALL LETTER U WITH CIRCUMFLEX ACCENT
FC LU17 LATIN SMALL LETTER U WITH DIAERESIS
FD LY11 LATIN SMALL LETTER Y WITH ACUTE ACCENT
FE LT63 LATIN SMALL LETTER ICELANDIC THORN
FF LY17 LATIN SMALL LETTER Y WITH DIAERESIS
ISO 8859-1 by [coarse, bandwith saving] pictures
-------------------------------------------------------------------------
| A0 | A1 | A2 | A3 | A4 | A5 | A6 | A7 |
|--------|--------|--------|--------|--------|--------|--------|--------|
| | XX | XX | XXX | | XX XX | XX | XXXXX |
| | | XX | XX XX |XX XX | XX XX | XX | XX X|
| | XX | XXXXXX | XX X | XXXXX | XXXX | XX | XXXX |
| | XX |XX |XXXX |XX XX | XXXXXX | | XX XX |
| | XXXX |XX | XX |XX XX | XX | | XX XX |
| | XXXX | XXXXXX | XX XX | XXXXX | XXXXXX | XX | XXXX |
| | XX | XX |XXXXXX |XX XX | XX | XX |X XX |
| | | XX | | | XX | XX | XXXXX |
-------------------------------------------------------------------------
-------------------------------------------------------------------------
| A8 | A9 | AA | AB | AC | AD | AE | AF |
|--------|--------|--------|--------|--------|--------|--------|--------|
| | XXXXXX | XXXX | | | | XXXXXX |XXXXXXXX|
|XX XX |X X| XX XX | XX XX| | |X X| |
| |X XXX X| XX XX | XX XX | | |X XXX X| |
| |X X X| XXXXX |XX XX |XXXXXXX | XXXXXX |X X X X| |
| |X X X| | XX XX | XX | |X XXX X| |
| |X XXX X| XXXXXX | XX XX| XX | |X X X X| |
| |X X| | | | |X X| |
| | XXXXXX | | | | | XXXXXX | |
-------------------------------------------------------------------------
-------------------------------------------------------------------------
| B0 | B1 | B2 | B3 | B4 | B5 | B6 | B7 |
|--------|--------|--------|--------|--------|--------|--------|--------|
| XXX | XX | XXXX | XXXX | XX | | XXXXXXX| |
| XX XX | XX | XX | XX | XX | |XX XX XX| |
| XX XX | XXXXXX | XX | XXX | XX | XX XX |XX XX XX| |
| XXX | XX | XX | XX | | XX XX | XXXX XX| XX |
| | XX | XXXXX | XXXX | | XX XX | XX XX| |
| | | | | | XX XX | XX XX| |
| | XXXXXX | | | | XXXXX | XX XX| |
| | | | | |XX | | |
-------------------------------------------------------------------------
-------------------------------------------------------------------------
| B8 | B9 | BA | BB | BC | BD | BE | BF |
|--------|--------|--------|--------|--------|--------|--------|--------|
| | XX | XXX | | XX XX| XX XX|XXX X| XX |
| | XXX | XX XX |XX XX |XXX XX |XXX XX | XX X | |
| | XX | XX XX | XX XX | XX XX | XX XX |XXX X | XX |
| | XX | XXX | XX XX| XXXX X | XXXXXX | XXX X | XX |
| | XXXX | | XX XX | XX XX | XX XX|XXXX XX | XX |
| XX | | XXXXX |XX XX | XX X X | XX XX | X X X | XX XX|
| XX | | | |XX XXXXX|XX XX | X XXXXX| XXXXX |
| XXX | | | | XX | XXXX|X XX | |
-------------------------------------------------------------------------
-------------------------------------------------------------------------
| C0 | C1 | C2 | C3 | C4 | C5 | C6 | C7 |
|--------|--------|--------|--------|--------|--------|--------|--------|
| XX | XX | XXXXX | XXX XX |XX XX | XXX | XXXXX | XXXXX |
| XX | XX |X X |XX XXX | XXX | XX XX | XX XX |XX XX |
| XXX | XXX | XXX | XXX | XX XX | XXXXX |XX XX |XX |
| XX XX | XX XX | XX XX | XX XX |XX XX |XX XX |XXXXXXX |XX |
|XX XX |XX XX |XX XX |XX XX |XXXXXXX |XXXXXXX |XX XX |XX XX |
|XXXXXXX |XXXXXXX |XXXXXXX |XXXXXXX |XX XX |XX XX |XX XX | XXXXX |
|XX XX |XX XX |XX XX |XX XX |XX XX |XX XX |XX XXX | XX |
| | | | | | | | XXXX |
-------------------------------------------------------------------------
-------------------------------------------------------------------------
| C8 | C9 | CA | CB | CC | CD | CE | CF |
|--------|--------|--------|--------|--------|--------|--------|--------|
| XX | XX | XXXXX |XX XX | XX | XX | XXXX | XX XX |
| XX | XX |X X | | XX | XX | X X | |
|XXXXXXX |XXXXXXX |XXXXXXX |XXXXXXX | XXXX | XXXX | XXXX | XXXX |
|XX |XX |XX |XX | XX | XX | XX | XX |
|XXXXXX |XXXXX |XXXXXX |XXXXXX | XX | XX | XX | XX |
|XX |XX |XX |XX | XX | XX | XX | XX |
|XXXXXXX |XXXXXXX |XXXXXXX |XXXXXXX | XXXX | XXXX | XXXX | XXXX |
| | | | | | | | |
-------------------------------------------------------------------------
-------------------------------------------------------------------------
| D0 | D1 | D2 | D3 | D4 | D5 | D6 | D7 |
|--------|--------|--------|--------|--------|--------|--------|--------|
|XXXXX | XXX XX | XX | XX | XXXXX | XXX XX |XX XX | |
| XX XX |XX XXX | XX | XX |X X |XX XXX | XXX |XX XX |
| XX XX | | XXX | XXX | XXX | XXX | XX XX | XX XX |
|XXXX XX |XXX XX | XX XX | XX XX | XX XX | XX XX |XX XX | XXX |
| XX XX |XXXX XX |XX XX |XX XX |XX XX |XX XX |XX XX | XX XX |
| XX XX |XX XXXX | XX XX | XX XX | XX XX | XX XX | XX XX |XX XX |
|XXXXX |XX XXX | XXX | XXX | XXX | XXX | XXX | |
| | | | | | | | |
-------------------------------------------------------------------------
-------------------------------------------------------------------------
| D8 | D9 | DA | DB | DC | DD | DE | DF |
|--------|--------|--------|--------|--------|--------|--------|--------|
| XXX X | XX | XX | XXXXX |XX XX | XX |XXXX | XXXX |
| XX XX | XX | XX |X X | | XX | XX |XX XX |
|XX XXX |XX XX |XX XX | |XX XX | XX XX | XXXXX |XX XX |
|XX X XX |XX XX |XX XX |XX XX |XX XX | XX XX | XX XX |XX XX |
|XXX XX |XX XX |XX XX |XX XX |XX XX | XXXX | XXXXX |XX XX |
| XX XX |XX XX |XX XX |XX XX |XX XX | XX | XX |XX XX |
|X XXX | XXXXX | XXXXX | XXXXX | XXXXX | XXXX |XXXX |XX XX |
| | | | | | | | |
-------------------------------------------------------------------------
-------------------------------------------------------------------------
| E0 | E1 | E2 | E3 | E4 | E5 | E6 | E7 |
|--------|--------|--------|--------|--------|--------|--------|--------|
| XX | XX | XXXXX | XXX XX |XX XX | XX | | |
| XX | XX |X X |XX XXX | | XX | | |
| XXXX | XXXX | XXXX | XXXXX | XXXX | XXXX | XXXXXX | XXXXXX |
| XX | XX | XX | XX | XX | XX | X X |XX |
| XXXXX | XXXXX | XXXXX | XXXXXX | XXXXX | XXXXX |XXXXXXX |XX |
|XX XX |XX XX |XX XX |XX XX |XX XX |XX XX |X X | XXXXXX |
| XXX XX | XXX XX | XXX XX | XXXXXX | XXX XX | XXX XX |XXXXXXX | XX |
| | | | | | | | XXX |
-------------------------------------------------------------------------
-------------------------------------------------------------------------
| E8 | E9 | EA | EB | EC | ED | EE | EF |
|--------|--------|--------|--------|--------|--------|--------|--------|
| XX | XX | XXXXX |XX XX | XX | XX | XXXXX | XX XX |
| XX | XX |X X | | XX | XX |X X | |
| XXXXX | XXXXX | XXXXX | XXXXX | | | XXX | XXX |
|XX XX |XX XX |XX XX |XX XX | XXX | XXX | XX | XX |
|XXXXXXX |XXXXXXX |XXXXXXX |XXXXXXX | XX | XX | XX | XX |
|XX |XX |XX |XX | XX | XX | XX | XX |
| XXXXX | XXXXX | XXXXX | XXXXX | XXXX | XXXX | XXXX | XXXX |
| | | | | | | | |
-------------------------------------------------------------------------
-------------------------------------------------------------------------
| F0 | F1 | F2 | F3 | F4 | F5 | F6 | F7 |
|--------|--------|--------|--------|--------|--------|--------|--------|
| XX | XXX XX | XX | XX | XXXXX | XXX XX |XX XX | |
| XXXXXX |XX XXX | XX | XX |X X |XX XXX | | XX |
| XX | | XXXXX | XXXXX | XXXXX | XXXXX | XXXXX | |
| XXXXX |XX XXX |XX XX |XX XX |XX XX |XX XX |XX XX | XXXXXX |
|XX XX | XX XX |XX XX |XX XX |XX XX |XX XX |XX XX | |
|XX XX | XX XX |XX XX |XX XX |XX XX |XX XX |XX XX | XX |
| XXXX | XX XX | XXXXX | XXXXX | XXXXX | XXXXX | XXXXX | |
| | | | | | | | |
-------------------------------------------------------------------------
-------------------------------------------------------------------------
| F8 | F9 | FA | FB | FC | FD | FE | FF |
|--------|--------|--------|--------|--------|--------|--------|--------|
| | XX | XX | XXXX |XX XX | XX |XXX |XX XX |
| X | XX | XX |X X | | XX | XX | |
| XXXXX |XX XX |XX XX | |XX XX |XX XX | XXXXX |XX XX |
|XX XXX |XX XX |XX XX |XX XX |XX XX |XX XX | XX XX |XX XX |
|XX X XX |XX XX |XX XX |XX XX |XX XX |XX XX | XX XX |XX XX |
|XXX XX |XX XX |XX XX |XX XX |XX XX | XXXXXX | XXXXX | XXXXXX |
| XXXXX | XXX XX | XXX XX | XXX XX | XXX XX | XX | XX | XX |
|X | | | | |XXXXXX |XXXX |XXXXXX |
-------------------------------------------------------------------------
Andr'e PIRARD
SEGI Univ. de Li`ege
B26 - Sart Tilman
B-4000 Li`ege 1 (Belgium)
PIRARD@BLIULG11 on EARN alias BITNET
pirard@vm1.ulg.ac.be on Internet